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Abstract 


In conversation, uptake happens when a 
speaker builds on the contribution of their in- 
terlocutor by, for example, acknowledging, re- 
peating or reformulating what they have said. 
In education, teachers’ uptake of student con- 
tributions has been linked to higher student 
achievement. Yet measuring and improving 
teachers’ uptake at scale is challenging, as ex- 
isting methods require expensive annotation 
by experts. We propose a framework for com- 
putationally measuring uptake, by (1) releas- 
ing a dataset of student-teacher exchanges ex- 
tracted from US math classroom transcripts 
annotated for uptake by experts; (2) formal- 
izing uptake as pointwise Jensen-Shannon Di- 
vergence (PJSD), estimated via next utterance 
classification; (3) conducting a linguistically- 
motivated comparison of different unsuper- 
vised measures and (4) correlating these mea- 
sures with educational outcomes. We find 
that although repetition captures a significant 
part of uptake, PJSD outperforms repetition- 
based baselines, as it is capable of identifying 
a wider range of uptake phenomena like ques- 
tion answering and reformulation. We apply 
our uptake measure to three different educa- 
tional datasets with outcome indicators. Un- 
like baseline measures, PJSD correlates signifi- 
cantly with instruction quality in all three, pro- 
viding evidence for its generalizability and for 
its potential to serve as an automated profes- 
sional development tool for teachers. | 


1 Introduction 


Building on the interlocutor’s contribution via, for 
example, acknowledgment, repetition or elabora- 
tion (Figure 1), is known as uptake and is key to 
a successful conversation. Uptake makes an inter- 
locutor feel heard and fosters a collaborative inter- 
action (Collins, 1982; Clark and Schaefer, 1989), 


‘Code and annotated data: https: //github.com/ 
ddemszky/conversational-uptake 
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Figure 1: Example student utterance s and possible 
teacher replies ¢, illustrating different uptake strategies. 


which is especially important in contexts like edu- 
cation. Teachers’ uptake of student ideas promotes 
dialogic instruction by amplifying student voices 
and giving them agency in the learning process, un- 
like monologic instruction where teachers lecture 
at students (Bakhtin, 1981; Wells, 1999; Nystrand 
et al., 1997). Despite extensive research showing 
the positive impact of uptake on student learning 
and achievement (Brophy, 1984; O’Connor and 
Michaels, 1993; Nystrand et al., 2003), measuring 
and improving teachers’ uptake at scale is challeng- 
ing as existing methods require manual annotation 
by experts and are prohibitively resource-intensive. 

We introduce a framework for computationally 
measuring uptake. First, we create and release 
a dataset of 2246 student-teacher exchanges ex- 
tracted from US elementary math classroom tran- 
scripts, each annotated by three domain experts for 
teachers’ uptake of student contributions. 

We take an unsupervised approach to measure 
uptake in order to encourage domain-transferability 
and account for the fact that large amounts of la- 
beled data are not possible in many contexts due 
to data privacy reasons and/or limited resources. 


We conduct a careful analysis of the role of repeti- 
tion in uptake by measuring utterance overlap and 
similarity. We find that the proportion of student 
words repeated by the teacher (%-IN-T) captures 
a large part of uptake, and that surprisingly, word- 
level similarity measures consistently outperform 
sentence-level similarity measures, including ones 
involving sophisticated neural models. 

To capture uptake phenomena beyond repetition 
and in particular those relevant to teaching (e.g. 
question answering), we formalize uptake as a 
measure of the reply’s dependence on the source 
utterance. We quantify dependence via pointwise 
Jensen-Shannon divergence (PJSD), which cap- 
tures how easily someone (e.g., a student) can 
distinguish the true reply from randomly sampled 
replies. We show that PJSD can be estimated via 
cross-entropy loss obtained from next utterance 
classification (NUC). 

We train a model by fine-tuning BERT-base 
(Devlin et al., 2019) via NUC on a large, combined 
dataset of student-teacher interactions and Switch- 
board (Godfrey and Holliman, 1997). We show that 
scores obtained from this model significantly out- 
perform our baseline measures. Using dialog act 
annotations on Switchboard, we demonstrate that 
PJSD is indeed better at capturing phenomena such 
as reformulation, question answering and collabora- 
tive completion than %-IN-T, our best-performing 
baseline. Our manual analysis also shows qualita- 
tive differences between the models: the examples 
where PJSD outperforms %-IN-T are enriched by 
teacher prompts for elaboration, an exemplar for 
dialogic instruction (Nystrand et al., 1997). 

Finally, we find that our PISD measure shows 
a significant linear correlation with outcomes 
such as student satisfaction and instruction quality 
across three different datasets of student-teacher 
interactions: the NCTE dataset (Kane et al., 2015), 
a one-on-one online tutoring dataset, and the 
SimTeacher dataset (Cohen et al., 2020). These 
results provide evidence for the generalizability of 
our PJSD measure and for its potential to serve as 
an automated tool to give feedback to teachers. 


2 Background on Uptake 


Uptake has several linguistic and social func- 
tions. (1) It creates coherence between two utter- 
ances, helping structure the discourse (Halliday and 
Hasan, 1976; Grosz et al., 1977; Hobbs, 1979). (2) 
It is a mechanism for grounding, i.e. demonstrat- 


ing understanding of the interlocutor’s contribu- 
tion by accepting it as part of the common ground 
(shared set of beliefs among interlocutors) (Clark 
and Schaefer, 1989). (3) It promotes collaboration 
with the interlocutor by sharing the floor with them 
and indicating what they have said is important 
(Bakhtin, 1981; Nystrand et al., 1997). 


There are multiple linguistic strategies for up- 
take, such as acknowledgment, collaborative com- 
pletion, repetition, and question answering — see 
Figure 1 for a non-exhaustive list. A speaker can 
use multiple strategies at the same time, for exam- 
ple, ¢3 in Figure 1 includes both acknowledgment 
and repetition. Different strategies can represent 
lower or higher uptake depending on how effec- 
tively they achieve the aforementioned functions 
of uptake. For example, Tannen (1987) argues 
that repetition is a highly pervasive and effective 
strategy for ratifying listenership and building a 
coherent discourse. In education, high uptake has 
been defined as cases where the teacher follows 
up on the student’s contribution via a question or 
elaboration (Collins, 1982; Nystrand et al., 1997). 


We build on this literature from discourse analy- 
sis and education to build our dataset, to develop 
our uptake measure and to compare the ability of 
different measures to capture key uptake strategies. 


3. A New Educational Uptake Dataset 


Despite the substantial literature on the functions 
of uptake, we are not aware of a publicly available 
dataset labeled for this phenomenon. To address 
this, we recruit domain experts (math teachers and 
raters trained in classroom observation) to anno- 
tate a dataset of exchanges between students and 
teachers. The exchanges are sampled from tran- 
scripts of 45-60 minute long 4th and 5th grade 
elementary math classroom observations collected 
by the National Center for Teacher Effectiveness 
(NCTE) between 2010-2013 (Kane et al., 2015). 
The transcripts represent data from 317 teachers 
across 4 school districts in New England that serve 
largely low-income, historically marginalized stu- 
dents. Transcripts are fully anonymized: student 
and teacher names are replaced with terms like 
“Student”, “Teacher” or ‘Mrs. HH”? 


*Parents and teachers gave consent for the study (Harvard 
IRB #17768), and for de-identified data to be retained and 
used in future research. The transcripts were anonymized at 
the time they were created. 


Preparing utterance pairs. We prepare a 
dataset of utterance pairs (S,7’), where S is a stu- 
dent utterance and T’ is a subsequent teacher utter- 
ance. The concept of uptake presupposes that there 
is something to be taken up; in our case that the 
student utterance has substance. For example, short 
student utterances like “yes” or “one-third” do not 
present many opportunities for uptake. Based on 
our pilot annotations, these utterances are difficult 
for even expert annotators to label. Therefore, we 
only keep utterance pairs where S contains at least 
5 tokens, excluding punctuation. We also remove 
all utterance pairs where the utterances contain an 
[Inaudible] marker, indicating low audio quality. 
Out of the remaining 55k (S, 7) pairs, we sample 
2246 for annotation.” 


Annotation. Given that uptake is a subjective 
and heterogeneous construct, we relied heavily on 
domain-expertise and took several other quality as- 
surance steps for the annotation. As a result, the 
annotation took six months to develop and com- 
plete, longer than most other annotations in NLP 
for a similar data size (~2k examples). 

Our annotation framework for uptake is designed 
by experts in math quality instruction, including 
our collaborators, math teachers and raters for the 
Mathematical Quality Instruction (MQI) coding 
instrument, used to assess math instruction (Teach- 
ing Project, 2011). In the annotation interface, 
raters can see (1) the utterance pair (S, 7’), (2) the 
lesson topic, which is manually labeled as part of 
the original dataset, and (3) two utterances immedi- 
ately preceding (5,7) for context. Annotators are 
asked to first check whether (S, 7’) relates to math 
—e.g. “Can I go to the bathroom?” is unrelated to 
math. If both S and T' relate to math, raters are 
asked to select among three labels: “low”, “mid” 
and “high”, indicating the degree to which a teacher 
demonstrates that they are following what the stu- 
dent is saying or trying to say. The annotation 
framework is included in Appendix A. 

We recruited expert raters (with experience in 
teaching and classroom observation) whose demo- 
graphics were representative of US K-12 teacher 
population. We followed standard practices in ed- 
ucation for rater training and calibration. We con- 
ducted several pilot annotation rounds (5+ rounds 


*To enable potential analyses on the temporal dynamics 
of uptake, we randomly sampled 15 transcripts where we 
annotate all (.S, 7’) pairs (constituting 29% of our annotations). 
The rest of the pairs are sampled from the remaining data. 


with a subset of raters, 2 rounds involving all 13 
raters), quizzes for raters, thorough documentation 
with examples, and meetings with all raters. After 
training raters, we randomly assign each example 
to three raters. 


Post-processing and rater agreement. Table | 
includes a sample of our annotated data. Inter-rater 
agreement for uptake is Spearman p = .474 (Fleiss 
K = 286°), measured by (1) excluding examples 
where at least one rater indicated that the utterance 
pair does not relate to math’; (2) converting rater’s 
scores into numbers (“low”: 0, “mid”: 1, “high”: 
2); (3) z-scoring each rater’s scores; (4) computing 
a leave-out Spearman p for each rater by correlating 
their judgments with the average judgments of the 
other two raters; and (5) taking the average of the 
leave-out correlations across raters. Our interrater 
agreement values comparable to those obtained in 
widely-used classroom observation protocols such 
as MQI and the Classroom Assessment Scoring 
System (CLASS) (Pianta et al., 2008) that include 
parallel measures to our uptake construct (see Kelly 
et al. (2020) for a summary).° We obtain a single 
label for each example by averaging the z-scored 
judgments across raters. 


4 Uptake as Overlap & Similarity 


As we see in Table 1, examples labeled for high 
uptake tend to have overlap between S' and T’; this 
is expected, since incorporating the previous utter- 
ance in some form is known to be an important as- 
pect of uptake (Section 2). Therefore, we begin by 
carefully analyzing repetition and defer discussion 
of more complex uptake phenomena to Section 5. 
To accurately quantify repetition-based uptake, 
we evaluate a range of metrics and surprisingly find 
that word overlap based measures correlate signif- 
icantly better with uptake annotations than more 
sophisticated, utterance-level similarity measures.’ 


“We prefer to use correlations because kappa has undesir- 
able properties (see Delgado and Tibau, 2019) and correlations 
are more interpretable and directly comparable to our models’ 
results (see later sections). 

This step is motivated by widely used education observa- 
tion protocols such as MQI, which also clearly separate on- vs 
off-task instruction. 

°High interrater variability — especially when it comes 
to ratings of teacher quality — are widely documented by 
gold standard studies in the field of education (see Cohen and 
Goldhaber (2016) for a summary). 

We focus on unsupervised methods that enable scalabil- 
ity and domain-generalizability; please see Appendix B for 
supervised baselines. 


Example Uptake Model p 95% CI 
S: ’Cause you took away 10 and 70 minus 10 is 60. hich Lcs .283 [.240, .329] 
T: Why did we take away 10? Jo-1N-T 523*** —[.488, 559] 
S: There’s not enough seeds. %-IN-S — [.399, 80) 
T: There’s not enough seeds. How do you know hich iaeiaice eke Lats “ey 
right away that 128 or 132 or whatever he ee — Lee 
it was you got doesn’t make sense? GLOVE [ALIGNED] 518 [.483, .550] 
S: Teacher L, can you change your dimensions GLOVE [UTT] ai [378, 465] 
like 3-D and stuff for your bars? mid SENTENCE-BERT 390 [.350, .432] 
UNIVERSAL SENTENCE ENCODER _ .448 [.408, .486] 


T: You can do 2-D or 3-D, yes. I already said that. 


S: The higher the number, the smaller it is. 


T: You got it. That’s a good thought. mit 
S: An obtuse angle is more than 90 degrees. 
T: Why don’t we put our pencils down and just do lo 
some brainstorming, and then we’ll go back - 
through it? 
S: Because the base of it is a hexagon. 

low 


T: Student K? 


Table 1: Examples from our annotated data, showing 
the majority label for each example. 


4.1 Methods 


We use several algorithms to better understand if 
word- or utterance-level similarity is a better mea- 
sure of uptake. For each token-based algorithm, 
we experiment with several different choices for 
pre-processing as a way to get the best possible 
baselines to compare to. We include symbols for 
the set of choices yielding best performance : re- 
moving punctuation #, removing stopwords using 
NLTK (Bird, 2006) ©, and stemming via NLTK’s 
SnowballStemmer f. 


String- and token-overlap. 


LCs: Longest Common Subsequence. 


%-IN-T: Fraction of tokens from S that are also 
in T (Miller and Beebe-Center, 1956). [@ © +] 


%-IN-S: Fraction of tokens from T that are also 
in S. [@@] 


JACCARD: Jaccard similarity (Niwattanakul et al., 
2013). [@@] 


BLEU: BLEU score (Papineni et al., 2002) for up 
to 4-grams. We use S as the reference and T as 
the hypothesis.[@ ® +] 


Embedding-based similarity. For the word 
vector-based metrics, we use 300-dimensional 
GloVe vectors (Pennington et al., 2014) pretrained 
on 6B tokens from Wikipedia 2014 and the Giga- 
word 5 corpus (Parker et al., 2011). 


Table 2: Results from our baseline measures. Asterisks 
indicate that %-IN-T significantly outperforms GLOVE 
[ALIGNED] (p < 0.001), measured by a paired boot- 
strap test, comparing the difference between the p ob- 
tained by %-IN-T and the one by GLOVE [ALIGNED] 
across 1000 iterations, then using a t-test. 


GLOVE [ALIGNED]: Average pairwise cosine 
similarity of word embeddings between tokens 
from S and its most similar token in T. [@] 


GLOVE [UTT]: Cosine similarity of utterance 
vectors representing S' and JT’. Utterance vectors 
are obtained by averaging word vectors from S 
and from T’. [@® ] 


SENTENCE-BERT: Cosine similarity of utterance 

vectors representing S and 7’, obtained using a 
pre-trained Sentence-BERT model for English 
(Reimers and Gurevych, 2019). 


UNIVERSAL SENTENCE ENCODER: Inner 
product of utterance vectors representing S and T, 
obtained using a pre-trained Universal Sentence 
Encoder for English (Cer et al., 2018). 


4.2 Results 


We compute correlations between model scores 
and human labels via Spearman rank order correla- 
tion p. We perform bootstrap sampling (for 1000 
iterations) to compute 95% confidence intervals. 

The results are shown in Table 2. Overall, 
we find that token-based measures outperform 
utterance-based measures, with %-IN-T (p = .523), 
GLOVE [ALIGNED] (¢ = .518) (a soft word over- 
lap measure) and BLEU (p = .510) performing 
the best. Even embedding-based algorithms that 
are computed at the utterance-level do not outper- 
form %-IN-T, a simple word overlap baseline. It 
is noteworthy that all measures have a significant 
correlation with human judgments. 


Shttps://github.com/UKPLab/ 
sentence-transformers 


The surprisingly strong performance of %-IN- 
T, GLOVE [ALIGNED] and BLEU provide further 
evidence that the extent to which T' repeats words 
from S is important for uptake (Tannen, 1987), es- 
pecially in the context of teaching. The fact that 
removing stopwords helps these measures suggests 
that the repetition of function words is less impor- 
tant for uptake; an interesting contrast to linguistic 
style coordination in which function words play a 
key role (Danescu-Niculescu-Mizil and Lee, 2011). 
Moreover, the amount of words T' adds in addition 
to words from S also seems relatively irrelevant 
based on the lower performance of the measures 
that penalize J’ containing words that are not in S 
— examples in Table 1 also support this result. 


5 Uptake as Dependence 


Now we introduce our main uptake measure, used 
to capture a broader range of uptake phenomena 
beyond repetition including, e.g., acknowledgment 
and question answering (Section 2). We formalize 
uptake as dependence of T’ on S, captured by the 
Jensen-Shannon Divergence, which quantifies the 
extent to which we can tell whether T' is a response 
to S or is it a random response ir. If we cannot 
tell the difference between T’ and T’, we argue that 
there can be no uptake, as T fails all three functions 
of coherence, grounding and collaboration. 

We can formally define the dependence for a 
single teacher-student utterance pair (s,¢) in terms 
of a pointwise variant of JSD (PJSD) as 


1 
pISD(t,8) t= -5( log P(Z=1|M =t,s) 


+ Elog(1- P(Z=11M=T',s))] +log(2) (1) 


where (S,7') is a teacher-student utterance pair, 
T’ isa randomly sampled teacher utterance that is 
independent of S, and M := ZT +(1—Z)T' isa 
mixture of the two with a binary indicator variable 
Z ~ Bern(p=0.5). 

This pointwise measure relates to the stan- 
dard JSD for T|S=s and T’ by taking 
expectations over the teacher utterance via 
E[pJSD(T, s)|S=s]=JSD(T|S=s||T’). We 
consider the pointwise variant for the rest of the 
section, as we are interested in a measure of depen- 
dence between a specific (t, s) rather than one that 
is averaged over multiple teacher utterances. 


5.1 Next Utterance Classification 


The definition of PJSD naturally suggests an esti- 
mator based on the next utterance classification 
task — a task previously used in neighboring NLP 
areas like dialogue generation and discourse coher- 
ence. We fine-tune a pre-trained BERT-base model 
(Devlin et al., 2019) on a dataset of (5,7) pairs 
to predict if a specific (s,t) is a true pair or not 
(i.e., whether t came from TJ’ or T’ . The objective 
function is cross-entropy loss, computed over the 
output of the final classification layer that takes in 
the last hidden state of t. Let Z be a binary indi- 
cator variable representing the model’s prediction. 
Then, the cross entropy loss for identifying z is 


L(t, 8) = —log fo(t, ) — Elog(1 — fo(T",s)) 
(2) 
Which can be used directly as an estimator for the 
log-probability terms in Equation 1, 


— 1 
pJSD(, 5) = 5 L(t, s) + log 2. (3) 


Standard variational arguments (Nowozin et al., 
2016) show that any classifier fg forms a lower 
bound on the JSD, 


JSD(T|S = s||T') = E[pJSD(T, s)|S = s]. 


Thus, our overall procedure is to fit fg(t, s) by max- 
imizing E[pJSD(t, s)] over our dataset and then 
use fg(t, s) (a monotone function of pJSD(t, s)) 
as Our pointwise measure of dependence. 


Training data. We use (S,7') pairs from three 
sources to form our training data: the NCTE dataset 
(Kane et al., 2015) (Section 3), Switchboard (God- 
frey and Holliman, 1997) and a one-on-one online 
tutoring dataset (Section 6) — we use a combina- 
tion of datasets instead of one dataset in order to 
support the generalizability of the model. Filter- 
ing out examples with S' < 5 tokens or [Inaudible] 
markers (Section 3), our resulting dataset consists 
of 259k (5, T) pairs. For each (s,t) pair, we ran- 
domly select 3 negative (s,t') pairs from the same 
source dataset, yielding 777k examples.” 


Parameter settings. We fine-tune our model for 
1 epoch to avoid overfitting with a batch size of 
32 X 2 gradient accumulation steps, max length of 


We do not split the data into training and validation sets, 
as we found that using predictions on the training data vs those 
on the test data as our uptake measure yield similar results, so 
we opted for maximizing training data size. 


Model p 95% CI 
%-IN-T  .523 [.488, .559] 
PJSD 540*** = [.505, .574] 


Table 3: Results from the PJSD model. The asterisks, 
calculated as in Table 2, indicate that the difference be- 
tween the two models’ performance is significant. 


120 tokens for S and T each (the rest is truncated), 
learning rate of 6.24e-5 with linear decay and the 
AdamW optimizer (Loshchilov and Hutter, 2017). 
Training took about 13hrs on a single TitanX GPU. 


5.2 Results & Analysis 


Table 3 shows that the PJSD model (p = .540) sig- 
nificantly outperforms %-IN-T. Our rough estimate 
on the upper bound of rater agreement (p = .539, 
obtained from a pilot annotation where all 13 raters 
rated 70 examples) indicate that our best models’ 
scores in a similar range as human agreement. ” 

Table 4 includes illustrative examples for model 
predictions. Our qualitative comparison of PJSD 
and %-IN-T indicates that (1) the capability of PJSD 
to differentiate between more and less important 
words in terms of uptake (Examples 1 and 6) ac- 
counts for many cases where PJSD is more accurate 
than %-IN-T, (2) neither model is able to capture 
rare and semantically deep forms of uptake (Exam- 
ple 3), (3) PJSD generally gives higher scores than 
%-IN-T to coherent responses with limited word 
overlap (Example 5). 

Now we turn to our motivating goals for propos- 
ing PJSD and quantitatively analyze its ability to 
capture more sophisticated forms for uptake. 


Comparison of linguistic phenomena. ‘To un- 
derstand if there is a pattern explaining PJSD’s bet- 
ter performance, we quantify the occurence of dif- 
ferent linguistic phenomena for examples where 
PJSD outperforms %-IN-T. Concretely, we com- 
pute the residuals for each model, regressing the 
human labels on their predictions. Then, we take 
those examples where the difference between the 
two models’ residuals is 1.5 standard deviations 
above the mean difference between their residu- 
als. We label teacher utterances in these examples 


Human agreement and model scores are not directly com- 
parable. The human agreement values (as reported here for 13 
raters and in Section 3 for 3 raters) are averaged leave-out es- 
timates across raters (skewed downward). The models’ scores 
represent correlations with an averaged human score, which 
smooths over the interrater variance of 3 raters. 


answer*** ® 
reformulation*** @ 
collaborative e 
completion*** 
acknowledgment*** @ 
repetition*** 
%-in-t | JSD 
is higher 6 - 6 is higher 
~<—. —_ 


Figure 2: The difference (6) between the scores from 
%-IN-T and PJSD for five uptake phenomena labeled 
in Switchboard. Asterisks indicate significance (***: 
p < 0.001), estimated via a median test. 


for four linguistic phenomena associated with up- 
take and good teaching (elaboration prompt, re- 
formulation, collaborative completion, and answer 
to question), allowing multiple labels (e.g. elab- 
oration prompt and completion often co-occur).'| 
As Table 5 shows, elaboration prompts, which are 
exemplars of high uptake in teaching (Nystrand 
et al., 1997) are significantly more likely to occur 
in this set — suggesting that there is a qualitative 
difference between what these models capture that 
is relevant for teaching. We do not find a signifi- 
cant difference in the occurrence of reformulations, 
collaborative completions and answers between the 
two sets, possibly due to the small sample size 
(n=67). To see whether these differences are sig- 
nificant on a larger dataset, we now turn to the 
Switchboard dialogue corpus. 


Switchboard dialog acts. We take advantage of 
dialog act annotations on Switchboard (Jurafsky 
et al., 1997), to compare uptake phenomena cap- 
tured by %-IN-T and PJSD at a large scale. We iden- 
tify five uptake phenomena labeled in Switchboard 
and map them to SWBD-DAMSL tags: acknowl- 
edgment, answer, collaborative completion, refor- 
mulation and repetition (see details in Appendix C). 

We estimate scores for %-IN-T and PJSD for 
all utterance pairs (.S, 7’) in Switchboard, filtering 
out ones where S < 5 tokens. We apply our PJSD 
model from Section 5.1, which was partially fine- 
tuned on Switchboard. Since both measures are 


"We label examples with above average uptake scores, as 
there is no trivial interpretation for uptake strategies labeled 
on low-uptake examples. 


Example 


Label 
(quartile) 


Model predictions 
PJSD Jo-IN-T 


S: 1 knew that eight was a composite number and - 
T: why? how? how did you know it was composite? 


S: do you have to know division to do fractions? 


T: i would think - division, sometimes, yes, you do need to know division to do some 
2 types of fractions. when we get to putting your fraction in simplest forms, yes, you 
need to know division and multiplication facts. you know something else you can find 


that comes in fractions? 


S: you put a one instead of a two. 


3 T: yesidid. thank you. you always correct me. that’s too high. let’s bring it down. 


how many times do you think, student d? 


mid 


top 


five, six, seven, eight, you take eight off. 


no, no, no equal pieces. right? okay so how many equal pieces do you need to make? 


S: 
T: 
S: ican prove it that it’s three hundred. 
T: and you think it’s -? 
S: 
a le 


oh, i see it. i see it. 


equivalent fraction. let’s see what you’ve got, student y. 


okay, now this is also another equivalent fraction. after you color, see if you see the 


bottom mid 


Table 4: Example model predictions, comparing the PJSD model to %-IN-T. All labels are converted to percentiles: 
top (75th), mid (25-75th) and bottom (25th). Green indicates correct predictions, red indicates predictions from 
the opposite quartile and grey indicates mid-range predictions. 


Label Examples 
: S: so it means that the whole equation 
elaboration ; 
nea is only the same. 
a > Ba T: what does it mean? i still don’t 
: understand what is it? 
S: multiplication is like, say, for instance, 
reformulation nine times twenty. you just take - nine just 


(2.6) nine times and add it up. 
T: okay, so repeated addition. 


S: do we look at the d or the m first? 
T: the m. what’s this called, that i’m writing? 


answer 
(2.67) 


collaborative S: we had to add twenty-four plus twenty-four. 
completion (0) T: because there are how many triangles? 


Table 5: Examples for linguistic phenomena, manually 
labeled in the dataset where PJSD and %-IN-T make 
significantly different predictions. Parenthetical num- 
bers after the labels represent the odds ratio of exam- 
ples with this label occurring in the set where PJSD per- 
forms better over the set where %-IN-T performs better 
(*: p < 0.05, computed via a Fisher exact test). 


bounded, we quantile-transform the distribution of 
each measure to have a uniform distribution. For 
each uptake phenomenon, we compute the differ- 
ence (0) between the median score from PJSD and 
the median score from %-IN-T for all (.S, 7’) pairs 
where T is labeled for that phenomenon. 

The results (Figure 2) show that PJSD predicts 
significantly higher scores than %-IN-T for all phe- 
nomena, especially for answers, reformulations, 


collaborative completions and acknowledgments. 
For repetition, 6 is quite small, but still significant 
due to the large sample size. These findings corrob- 
orate our hypothesis that %-IN-T and PJSD capture 
repetition similarly, but PJSD is able to better cap- 
ture other uptake phenomena. 


6 Downstream Application 


To test the generalizability of our uptake measures 
and their link to instruction quality, we correlate 
PJSD and %-IN-T with educational outcomes on 
three different datasets of student-teacher interac- 
tions (Table 6). 


NCTE dataset. We use all transcripts from the 
NCTE dataset (Kane et al., 2015) (Section 3) 
with associated classroom observation scores based 
on the MQI coding instrument (Teaching Project, 
2011). We select two items from MQI relevant to 
uptake as outcomes: (1) use of student math contri- 
butions and (2) overall quality of math instruction. 
Since these items are coded at a 7-minute segment- 
level, we take the average ratings across raters and 
segments for each transcript. 


Tutoring dataset. We use data from an educa- 
tional technology company (same as in Chen et al., 
2019), which provides on-demand text-based tu- 
toring for math and science. With a mobile appli- 
cation, a student can take a picture of a problem 


Dataset Size Genre Topic Class size Outcome PJSD (8) %-IN-T (() 
1.6k conv. in-person use of student contributions .101*** = .113*** 
hole cl 
oe 55k (5,7) spoken “ whole C@SS math instruction quality O91*** = 121 *** 
338 conv. virtual 
SimTeacher Sk (SF). spoken literature small group quality of feedback 127 123 
Tateita 4.6k conv. virtual math, bee aeceae student satisfaction .069*** — 008 
g 85k (S,T) written science external reviewer rating .063*** = 021 


Table 6: The correlation of uptake scores from PJSD and %-IN-T and outcomes for three educational datasets. The 
6 values represent z-scored coefficients, each obtained from an ordinary least squares regression, controlling for 
the number of (5, 7’) pairs with uptake scores in each conversation (*: p < 0.05, **: p < 0.01, ***: p < 0.001). 


or write it down, and is then connected to a pro- 
fessional tutor who guides the student to solve the 
problem. Similarly to Chen et al. (2019), we filter 
out short sessions where the tutors are unlikely to 
deliver meaningful tutoring. Specifically, we create 
a list of (5, T’) pairs for all sessions, keeping pairs 
where S' = 5 tokens, and then remove sessions with 
fewer than ten (5, 7’) pairs. This results in 4604 
sessions, representing 108 tutors and 1821 students. 
Each session is associated with two outcome mea- 
sures: (1) student satisfaction scores (1-5 scale) 
and (2) a rating by the tutor manager based on an 
evaluation rubric (0-1 scale). 


SimTeacher dataset. We use a dataset collected 
by Cohen et al. (2020), via a mixed reality sim- 
ulation platform in which novice teachers get to 
practice key classroom skills in a virtual classroom 
interface populated by student avatars. The avatars 
are controlled remotely by a trained actor; hence 
the term “mixed” reality. All pre-service teach- 
ers from a large public university complete a five- 
minute simulation session at multiple timepoints in 
their teacher preparation program, and are coached 
on how to better elicit students’ thinking about a 
text. We use data from Fall 2019, with 338 sessions 
representing 117 teachers. Since all sessions are 
based on the same scenario (discussed text, lead- 
ing questions, avatar scripts), this dataset uniquely 
allows us to answer the question: controlling for 
student avatar scripts, does a greater teacher uptake 
lead to better outcomes? For the outcome variable, 
we use their holistic “quality of feedback” measure 
(1-10 scale), annotated at the transcript-level by the 
original research team. i 


This overall quality scale accounts for the extent to which 
teachers actively work to support student avatars’ develop- 
ment of text-based responses, highlighting the importance of 
probing student responses (e.g. “Where in the text did you see 
that?”; “What made you think this about the character?’’). 


6.1 Results & Analysis 


As outcomes are linked to conversations, we first 
mean-aggregate uptake scores to the conversation- 
level. We then compute the correlation of up- 
take scores and outcomes using an ordinary least 
squares regression, controlling for the number of 
(S, 7’) pairs in each conversation. 


The results (Table 6) indicate that PJSD cor- 
relates with all of the outcome measures signifi- 
cantly. %-IN-T also shows significant correlations 
for NCTE and for SimTeacher, but not for the tu- 
toring dataset. We provide more details below. 


For NCTE and SimTeacher, we find that two 
measures show similar positive correlations with 
outcomes. These results provide further insight into 
our earlier findings from Section 5.2. They suggest 
that the teacher’s repetition of student words, also 
known as “revoicing” in math education (Forman 
et al., 1997; O’Connor and Michaels, 1993), may 
be an especially important mediator of instruction 
quality in classroom contexts and other aspects of 
uptake are relatively less important. The significant 
correlation of PJSD with the outcome in case of 
SimTeacher is especially noteworthy because PJSD 
was not fine-tuned on this dataset (Section 5.1); 
this provides evidence for the adaptability of a pre- 
trained model to other (similar) datasets. 


The gap between the two measures in case of 
the tutoring dataset is an interesting finding, possi- 
bly explained by the conversational setting: repeti- 
tion may be an effective uptake strategy in multi- 
participant & spoken settings, ensuring that every- 
one has heard what the student said and is on the 
same page; whereas, in a written 1:1 teaching set- 
ting, repetition may not be necessary or effective 
as both participants are likely to assume that that 
their interlocutor has read their words. Our qualita- 
tive analysis suggests PJSD might be outperform- 
ing %-IN-T because it is better able to pick up 


- high student feedback (%-IN-T < PJSD) 


__ low student feedback (PJSD < %-IN-T) 


S: if they’re the same length i think 

T: that’s right! all we need is the length, and that’s enough. 

S: the energy from the one pendulum moving will transfer the 
same frequency to the second pendulum once they touch? 

T: they don’t even need to touch! we can swing them so they 
swing side by side, like two swings on a swingset. 

S: pendulum one will start to absorb energy from pendulum two? 
T: exactly! and eventually, the whole process will reverse until 
pendulum one is moving full speed again. 


S: when you are saying mixture are you talking about nitrogen? 
T: thanks for your question. 

S: no i don’t think so 

T: great answer! 

S: i don’t know , just made an educated guess 

T: great try! 

S: i want further explanation about volume and 

number moles when using nitrogen 

T: sure. no worries! 


Table 7: Examples from the tutoring dataset — for both examples, the predictions by PJSD are more accurate than 
the ones by %-IN-T that predicts too low and too high values, respectively, when compared to student ratings. 


on cues related to teacher responsiveness (we in- 
clude two examples in Table 7). To test this, we 
detect coarse-grained estimates of teacher uptake: 
teacher question marks (estimate of follow-up ques- 
tion) and teacher exclamation marks (estimate of 
approval). We then follow the same procedure as in 
Section 5.2 and find that dialogs where PJSD outper- 
forms %-IN-T, in terms of predicting student rat- 
ings, have a higher ratio of exchanges with teacher 
questions (p < 0.05, obtained from two-sample 
t-test) and teacher exclamation marks (p < 0.01). 

To put these effect sizes from Table 6 (where sig- 
nificant) in the context of education interventions 
that are designed to increase student outcomes (typ- 
ically test scores), the coefficients we report here 
are considered average for an effective educational 
intervention (Kraft, 2020). Further, existing guide- 
lines for educational interventions would classify 
uptake as a promising potential intervention, as it 
is highly scalable and easily quantified. 


7 Related Work 


Prior computational work on classroom discourse 
has employed supervised, feature-based classifiers 
to detect teachers’ discourse moves relevant to stu- 
dent learning, such as authentic questions, elabo- 
rated feedback and uptake, treating these moves as 
binary variables (Samei et al., 2014; Donnelly et al., 
2017; Kelly et al., 2018; Stone et al., 2019; Jensen 
et al., 2020). Our labeled dataset, unsupervised 
approach (involving a state-of-the art pre-trained 
model), and careful analysis across domains are 
novel contributions that will enable a fine-grained 
and domain-adaptable measure of uptake that can 
support researchers and teachers. 

Our work aligns closely with research on the 
computational study of conversations. For example, 
measures have been developed to study construc- 
tiveness (Niculae and Danescu-Niculescu-Mizil, 


2016), politeness (Danescu-Niculescu-Mizil et al., 
2013) and persuasion (Tan et al., 2016) in conversa- 
tions. Perhaps most similar to our work, Zhang and 
Danescu-Niculescu-Mizil (2020) develop an unsu- 
pervised method to identify therapists’ backward- 
and forward-looking utterances, with which they 
guide their conversations. 

We also draw on work measuring discourse co- 
herence via embedding cosines (Xu et al., 2018; 
Ko et al., 2019), or via utterance classification (Xu 
et al., 2019; Iter et al., 2020), the latter of which 
is used also for building and evaluating dialog sys- 
tems (Lowe et al., 2016; Wolf et al., 2019). Our 
work extends these two families of methods to hu- 
man conversation and highlights the different lin- 
guistic phenomena they capture. Finally, our work 
shows the key role of coherence in the socially 
important task of studying uptake. 


8 Conclusion 


We propose a framework for measuring uptake, a 
core conversational phenomenon with particularly 
high relevance in teaching contexts. We release an 
annotated dataset and develop and compare unsu- 
pervised measures of uptake, demonstrating signif- 
icant correlation with educational outcomes across 
three datasets. This lays the groundwork (1) for 
scaling up teachers’ professional development on 
uptake thereby enabling improvements to educa- 
tion, (2) for conducting analyses on uptake across 
domains and languages where labeled data does 
not exist and (3) for studying the effect of uptake 
on a wider range of socially relevant outcomes. 
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9 Ethical Considerations 


Our objective in building a dataset and a frame- 
work for measuring uptake is (1) to aid researchers 
studying conversations and teaching and (2) to (ulti- 
mately) support the professional development of ed- 
ucators by providing them with a scalable measure 
of a phenomenon that supports student learning. 
Our second objective is especially important, since 
existing forms of professional development aimed 
at improving uptake are highly resource intensive 
(involving classroom observations and manual eval- 
uation). This costliness has meant that teachers 
working in under-resourced school systems have 
thus far had limited access to quality professional 
development in this area. 

The dataset we release is sampled from tran- 
scripts collected by the National Center for Teacher 
Effectiveness (NCTE) (Kane et al., 2015) (Har- 
vard IRB #17768). These transcripts represent data 
from 317 teachers across 4 school districts in New 
England that serve largely low-income, historically 
marginalized students. The data was collected as 
part of a carefully designed study on teacher ef- 
fectiveness, spanning three years between 2010 
and 2013 and it was de-identified by the original 
research team, meaning that in the transcripts, stu- 
dent names are replaced with “Student” and teacher 
names are replaced with ““Teacher”. Both parents 
and teachers gave consent for the de-identified 
data to be retained and used in future research. 
The collection process and representativeness of 
the data are all described in great detail in (Kane 
et al., 2015). Given that the dataset was collected a 
decade ago, there may be limitations to its use and 
ongoing relevance. That said, research in education 
reform has long attested to the fact that teaching 
practices have remained relatively constant over 
the past century (Cuban, 1993; Cohen and Mehta, 
2017) and that there are strong socio-cultural pres- 
sures that maintain this (Cohen, 1988). 

The data was annotated by 13 raters, whose de- 
mographics are largely representative of teacher 
demographics in the US". All raters have do- 
main expertise, in that they are former or cur- 
rent math teachers and former or current raters 
for the Mathematical Quality Instruction (Teach- 
ing Project, 2011). The raters were trained for at 
least an hour each on the coding instrument and 
spent 8 hours on average on the annotation (over 


Bnetps://nces.ed.gov/fastfacts/display. 
asp?id=28 


the course of several weeks) and were compensated 
$16.5 / hr. 

In Section 6, we apply our data to to two educa- 
tional datasets besides NCTE. We do not release 
either of these datasets. The SimTeacher dataset 
was collected by Cohen et al. (2020) (University of 
Virginia IRB #2918), for research and program im- 
provement purposes. The participants in the study 
are mostly white (82%), female (90%), and middle 
class (71%), mirroring the broader teaching profes- 
sion. As for the tutoring dataset, the data belongs 
to a private company; the students and tutors have 
given consent for their data to be used for research, 
with the goal of improving the company’s services. 
The company works with a large number of tutors 
and students; we use data that represents 108 tutors 
and 1821 students. 70% of tutors in the data are 
male, complementing the other datasets where the 
majority of teachers are female. The company does 
not share other demographic information about tu- 
tors and students. 

Similarly to other data-driven approaches, it is 
important to think carefully about the source of 
the training data when considering downstream use 
cases of our measure. Our unsupervised approach 
helps address this issue as it allows for training the 
model on data that is representative of the popula- 
tion that it is meant to serve. 
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A Annotation Framework 


Figure 3 shows a screenshot of our annotation in- 
terface. In the annotation framework, we used the 
term “active listening” to refer to uptake, since we 
found that active listening is more interpretable 
to raters, while uptake is too technical. However, 
the difference in terminology should not affect the 
annotations, since the two constructs are synony- 
mous and we designed the annotation instructions 
entirely based on the linguistics and education lit- 
erature on uptake. For example, the title of the in- 
struction manual is “Annotating Teachers’ Uptake 
of Student Ideas”, and we define different levels of 
uptake with phrasings such as “the teacher provides 
evidence for following what the student is saying 
or trying to say’, linking our definition to Clark 
and Schaefer (1989)’s theory on grounding. We 
include annotation instructions with the dataset. 


Coding Instructions 


Lesson topic: Solving word problems 


At Miss C's Confection's you can order two kinds of cakes, chocolate or vanilla. You 
can choose from five different frosting flavors for your cake: fudge, banana, 
strawberry, vanilla, or lemon. How many different kinds of cake combinations could 
you order if you choose one cake and one frosting? 


Student 


Oh, my goodness. Those are one of those doozies, right? Well let's see how we do it. 
At Miss C's Confections, you could order two kinds of cakes: chocolate or vanilla. 
See how I'm visualizing? Right, chocolate or vanilla. You can choose from five 
different frosting flavors for your cake. What are the five flavors? Who can help me? 


Teacher 


1. Validity 
If any of the conditions below is not met, you can stop coding the example. 


Student utterance relates to mathematics. 
Teacher utterance relates to mathematics. 


2. Display of Active Listening 
To what degree does the teacher show that they are listening to the student’s idea? 


O Low O Mid O High 


4. Comments? 
Optional, only add if necessary. 


Figure 3: Screenshot of the annotation interface. 


Model p 
PJSD .540 


RoBERTa-base_ _.561 


BERT-base .618 


Table 8: Supervised model results. 


B Supervised Model Results 


We conducted experiments to compare the perfor- 
mance of our unsupervised models to that of su- 
pervised models. We randomly split the annotated 
data into training (80%) and test (20%) sets, using 
the z-scored rater judgments as labels (Section 3). 
We trained BERT-base (Devlin et al., 2019) and 
RoBERTa-base (Liu et al., 2019) on this data for 
10 epochs with early stopping, and a batch size 
of 8 xX 2 gradient accumulation steps — all other 
parameters are defaults set by Huggingface* 

The results are shown in Table 8. The supervised 
models outperform our unsupervised models by 
less than .08, indicating the competitiveness of our 
unsupervised methods. Interestingly, we also find 
that BERT outperforms RoBERTa, a gap that per- 
sisted despite tuning the number of training epochs. 
Since our paper’s focus is unsupervised methods 
that enable scalability and domain-generalizability, 
we leave more extensive parameter search and su- 
pervised model comparison for future work. 


C Mapping the SWBD-DAMSL Tagset 
to Uptake Phenomena 


We map tags from SWBD-DAMSL (Jurafsky et al., 
1997) to five salient uptake phenomena: acknowl- 
edgment, answer, reformulation, collaborative com- 
pletion and repetition. Table 9 summarizes our 
mapping. Since acknowledgment is highly fre- 
quent and it can co-occur with several other dialog 
acts, we consider those examples to be acknowl- 
edgments that are labeled exclusively for this phe- 
nomenon (using either the tag b, bh or bk). 


“https ://huggingface.co/ 


Uptake phenomenon DAMSL Tags % of Examples 


acknowledgment b, bh, bk 81% 
answer tags containing “n” 13% 
reformulation bf 2% 
collaborative completion “2 2% 
repetition “m 2% 


Table 9: Mapping between uptake phenomena and tags from SWBD-DAMSL (Jurafsky et al., 1997). 


